Introduction to Open Data Science - Week 1, Intro

1.1 I am rather exited to start this course, although I am a bit worried that the workload will be too much for me. I hope though, that if I manage with the the workload, I will have the skills to start independently developing as a user of R and a contributor to open data science. It is especially the latter point that brought me to this course, although I did not actively look for it. I just happened to run into it on the course page of the University of Helsinki.

1.2 In what has been a rather unpleasant experience, GitHub did not immediately work for me. Nothing seemed to upload to my diary. Now, I think, I have managed to overcome my issues and get things to work most of the time. Learning by doing, I suppose!

To summarize:
- I want to learn open data science.
- I am worried the workload will be too much.
- I look forward to the course.
- I hope I will figure out the specifics of git and rstudio.

Here’s the link to by GitHub repository.


Introduction to Open Data Science - Week 2, Regression

Libraries:

library(dplyr)
library(ggplot2)
library(GGally)

This chapter analyses a selection of data from a 2014 survey of students participating in an introductory statistics course in Finland. The survey mapped students’ learning approaches and learning achievements. While the original data contained 183 observations of 60 variables, a more limited dataset of 166 observations of 7 variables will be employed here. These variables are the age and gender of the participants, their points from the course representing their performance, their attitude towards the course, and three variables mapping their learning styles. These learning styles were the “surface approach,” indicating memorization without deeper engagement, “deep approach,” indicating an intention to maximize understanding of the subject matter, and “strategic approach,” indicating an approach aimed at maximizing the students chance at a good grade. The variables “attitude,” “surface approach,” “deep approach,” and “strategic approach” are all aggregate mean measures of other variables. As such, each variable summarizes related observations into an average. This analysis used the below script, in combination with existing knowledge, to interpret the dataset:

Learn2014 <- read.table("Data/Learn2014", header = TRUE, sep = "\t")
Learn2014$gender <- factor(Learn2014$gender, levels=c("0","1"), labels=c(0,1))
str(Learn2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ Age   : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ gender: Factor w/ 2 levels "0","1": 2 1 2 1 1 2 1 2 1 2 ...
##  $ attit : num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep  : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ surf  : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ strat : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ Points: int  25 12 24 10 22 21 21 31 24 26 ...

The below graphs and summaries of the data help us gain an initial picture of the trends present therein. For one, we can see that a vast majority of students participating in this survey were female (110 v. 56 males), with a mean age of 25 and a half years and approx. 75% of students being below the age of 27.

As for the variables related to studying, all of them approximate a normal distribution, although with a slight skew to the right. Certain immediately interesting pieces of information arise from the correlation numbers. Firstly, positive attitude is strongly correlated with higher points, while the deep approach seems to counter intuitively have little effect on performance. Nevertheless the surface approach seems to predict a slightly worse performance, while the strategic approach predicts a slightly better performance. Curiously, age among men seems to predict a worser performance, although this might be due to two outliers. We shall next test these initial findings with a multiple linear regression.

Graph_AgeGeN <- ggpairs(Learn2014, columns = c(1, 2), legend = 3, title = "Age and Gender", mapping = aes(col = gender), lower = list(combo = wrap("facethist", bins = 20)))

Graph_AgeGenPoints <- ggpairs(Learn2014, columns = c(1,2,7), title = "Effects of Age and Gender on Points", mapping = aes(shape = gender, col = gender), lower = list(combo = wrap("facethist", bins = 20)))

Graph_PredPoints <- ggpairs(Learn2014, columns = c(3:7), title = "Attitude, Study Style and Points", mapping = aes(shape = gender, col = gender), lower = list(combo = wrap("facethist", bins = 20)))
Graph_AgeGeN

Graph_AgeGenPoints

Graph_PredPoints

summary(Learn2014)
##       Age        gender      attit            deep            surf      
##  Min.   :17.00   0: 56   Min.   :1.400   Min.   :1.583   Min.   :1.583  
##  1st Qu.:21.00   1:110   1st Qu.:2.600   1st Qu.:3.333   1st Qu.:2.417  
##  Median :22.00           Median :3.200   Median :3.667   Median :2.833  
##  Mean   :25.51           Mean   :3.143   Mean   :3.680   Mean   :2.787  
##  3rd Qu.:27.00           3rd Qu.:3.700   3rd Qu.:4.083   3rd Qu.:3.167  
##  Max.   :55.00           Max.   :5.000   Max.   :4.917   Max.   :4.333  
##      strat           Points     
##  Min.   :1.250   Min.   : 7.00  
##  1st Qu.:2.625   1st Qu.:19.00  
##  Median :3.188   Median :23.00  
##  Mean   :3.121   Mean   :22.72  
##  3rd Qu.:3.625   3rd Qu.:27.75  
##  Max.   :5.000   Max.   :33.00

For the below multiple linear regression, three predictor variable have been chosen: Attitude, the surface approach, and the strategic approach. These variables were chosen due to their relatively higher correlations compared to other available variables (age for males is excluded due to the presence of outliers skewing the calculation). The below multiple linear regression shows that only attitude has a statistically significant impact on points, as it is the only independent variable that has its p-value below 0.05. In the case of attitude, there is a less than 0.1 percent chance that the null-hypothesis (attitude has no effect on points) is correct under the observed circumstances. Not only is attitude a statistically significant predictor of points, it also seems to have a strong impact, with its beta coefficient being approx. 3.4. This means that with each 1-point step towards a better attitude on the linkert scale, points seem to rise approximately by 3.4.

With the remainder of data, the likelihood is above 5 percent, which is the conventional cut line for statistical significance. This interpretation is also supported by the t-values, which conventionally are expected to be larger than 2, or lesser than -2, to indicate statistical significance. Altogether, this model nevertheless only explain approximately 20% of the variation in points, meaning that is not a very good predictive model.

Points_regression <- lm(Points ~ attit + strat + surf, data = Learn2014)
summary(Points_regression)
## 
## Call:
## lm(formula = Points ~ attit + strat + surf, data = Learn2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attit         3.3952     0.5741   5.913 1.93e-08 ***
## strat         0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

To further test the significance of attitude, the Surface Approach and Strategic Approach variables will be removed and a simple linear regression carried out with just attitude as the predictor variable. This, nevertheless, produced no novel results and with the dropping of variables, also the explanatory power, Multiple R_squared, of the model has gone down from 0.2 to 0.19. This means that changes in students’ attitude can help explain 19% of the changes in students’ score. The fact that the reduction is so minor is further indication of the minor impact of Surface Approach and Strategic Approach variables. To play around a bit, I have also included a multiple linear regression with age included. Nevertheless, neither this has had any effect on the model. The slight rise in R-squared is to be expected every time a predictive variable is added.

Points_Attit_Reg <- lm(Points ~ attit, data = Learn2014)
summary(Points_Attit_Reg)
## 
## Call:
## lm(formula = Points ~ attit, data = Learn2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attit         3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09
AgeAttit_Regression <- lm(Points ~ attit + Age, data = Learn2014)
summary(AgeAttit_Regression)
## 
## Call:
## lm(formula = Points ~ attit + Age, data = Learn2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.3354  -3.3095   0.2625   4.0005  10.4911 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.57244    2.24943   6.034 1.04e-08 ***
## attit        3.54392    0.56553   6.267 3.17e-09 ***
## Age         -0.07813    0.05315  -1.470    0.144    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.301 on 163 degrees of freedom
## Multiple R-squared:  0.2011, Adjusted R-squared:  0.1913 
## F-statistic: 20.52 on 2 and 163 DF,  p-value: 1.125e-08

To validate the model, this final section will run three plots to test that the assumptions of a regression model are filled by the data. For this validation, the simple linear regression model of Points_Attit_Reg will be used, as it is the most efficient of the models produced. The below graphs “Residuals vs. Fitted,” “Normal Q-Q,” and “Residuals vs. Leverage” test whether the assumptions of normal distribution, non-correlation, and constant variance of errors are met.

The Q-Q plot tests whether errors are normally distributed. The below graph shows that the dots fit reasonably on the line, although as we move towards the more extreme quantiles we can see that the distribution shows signs of being leptokurtic and as such might not be normally distributed. Nevertheless, this analysis interprets this distribution of errors as normal.

The Residuals vs Fitted graph tests the assumption of constant variance of errors by plotting residuals against predicted values. As we can see no discernible pattern in the data, we can interpret the graph as showing no indication of the size of the error depending of the predicted value. Thus constant variance of errors is established

The Residuals vs Leverage graph shows us that none of the datapoints have an unreasonably high power to pull the models predictions outwards themselves. This means that there are no outliers in the dataset. This, in combination with the above tests indivated that the model is valid, as it adheres to the integral assumptions of linear regression.

plot(Points_Attit_Reg, which = c(1,2,5))


Introduction to Open Data Science - Week 3, Logistic Regression

Libraries:

library(dplyr)
library(ggplot2)
library(GGally)
library(boot)

2. and 3.
Data Description with Variable Selection and Justification

The below glimpsed dataset “TheData,” contains the questionnaire answers of 382 students from two Portuguese secondary schools . The answers were given by students attending maths and Portuguese language courses, each group having produced their own datasets that have here been combined into one dataset. In the process of combining the data, observations have been selected in a manner that assures that 13 identifying variables do not contain empty values. This has resulted in a reduction from 1044 to 382 observations per variable.

The questionnaire was created to predict the target variable of “G3,” ie. the final grade of the student attending the course. Accordingly the variables can be said to have at least a potential link to school performance, although some variables (such as whether the student lives in an urban or rural area) arguably have a more tenuous theoretical link to school performance than others (such as whether the student receives additional educational support). A glimpse of the data is provided below:

## Rows: 382
## Columns: 35
## $ school     <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP"…
## $ sex        <chr> "F", "F", "F", "F", "F", "M", "M", "F", "M", "M", "F", "F"…
## $ age        <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, 15…
## $ address    <chr> "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U"…
## $ famsize    <chr> "GT3", "GT3", "LE3", "GT3", "GT3", "LE3", "LE3", "GT3", "L…
## $ Pstatus    <chr> "A", "T", "T", "T", "T", "T", "T", "A", "A", "T", "T", "T"…
## $ Medu       <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3, 4…
## $ Fedu       <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2, 3…
## $ Mjob       <chr> "at_home", "at_home", "at_home", "health", "other", "servi…
## $ Fjob       <chr> "teacher", "other", "other", "services", "other", "other",…
## $ reason     <chr> "course", "course", "other", "home", "home", "reputation",…
## $ nursery    <chr> "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "ye…
## $ internet   <chr> "no", "yes", "yes", "yes", "no", "yes", "yes", "no", "yes"…
## $ guardian   <chr> "mother", "father", "mother", "mother", "father", "mother"…
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1, 1…
## $ studytime  <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1, 1…
## $ failures   <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0…
## $ schoolsup  <chr> "yes", "no", "yes", "no", "no", "no", "no", "yes", "no", "…
## $ famsup     <chr> "no", "yes", "no", "yes", "yes", "yes", "no", "yes", "yes"…
## $ paid       <chr> "no", "no", "yes", "yes", "yes", "yes", "no", "no", "yes",…
## $ activities <chr> "no", "no", "no", "yes", "no", "yes", "no", "no", "no", "y…
## $ higher     <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "y…
## $ romantic   <chr> "no", "no", "no", "yes", "no", "no", "no", "no", "no", "no…
## $ famrel     <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5, 3…
## $ freetime   <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5, 1…
## $ goout      <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5, 3…
## $ Dalc       <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1…
## $ Walc       <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4, 3…
## $ health     <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5, 5…
## $ absences   <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, 3, 9, 5…
## $ G1         <int> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11, 14, 16…
## $ G2         <int> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11, 15, 1…
## $ G3         <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 12, 16, …
## $ alc_use    <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1.5, 1.0…
## $ alcoholics <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…

For the purposes of this analysis, four variables relating to alcohol use have been selected. The primary purpose of this analysis is to examine the effects of alcohol use on the final grade. Accordingly the variable “G3,” the final grade on a 20-point scale, is a given. As is “alc_use,” the variable mapping alcohol use on a five-point scale where “1” indicates very low consumption and “5” very high consumption (This variable is the mean of the student’s alcohol consumption on weekdays and weekends, mapped by variables “Dalc” and “Walc,” respectively). The hypothesized relationship between “G3” and “alc_use” is that the higher consumption of alcohol is a predictor of lower achievement in school, which is represented by G3. Furthermore, it is hypothesized that the mechanism that might explain any potential causal relationship is the amount of absences (measured in days out of 93) arising from either reduced energy or hangover caused by higher amounts of drinking (I realize that this is a bold assumption to make before examining the relationship between alcohol use, absences and the final grade, but the tasking requires naming the four variables now.) The benefit of this causal explanation is that it does not require knowledge on the effects of alcohol use on the brain, nor does it, in the case of having such knowledge, demand that the high use is long term - a common qualifier with alcohol related learning difficulties, but something for which the dataset contains no data.

As the working theory is that higher alcohol use has a negative effect on school performance, it is also useful to theorize about the reasons behind higher alcohol use. Here two variables are examined: “freetime,” ie. how much free time the student has in a week on a five-point scale (1 denoting very little, 5 very much), and “famrel,” ie. how good the students relationship is to their family on a five-point scale (1 denoting a very bad relationship, 5 an excellent relationship.). The theorized relationships are as follows: the more free time one has, the more they drink to pass the time, and the worse their relationship is with their family, the more they drink for comfort (The same cave-at applies here, as with the previous relationship). These are the relationships that will be explored below: A) The effects of alcohol use on the final grade; B) The effects of alcohol use on absences and the effects of absences on the grade; C) The effects of free time on alcohol use; D) The effects of family relations on alcohol use. Any further interesting relationships will be explored as warranted by the initial results (such as the effects of family relationship, given a lot of free time, on alcohol use).

4.
Numerical and graphical exploration of relationships A through D.

A and B

The above set of graphs explores the relationships between alcohol use, absences and the final grade. The results have been further divided by sex in the spirit of last week. A few noteworthy points can immediately be noticed. Firstly, there seems to be, overall, no statistically significant relationship between the number of absences and the final grade. This, if anything, is a troubling result for Portuguese teachers. Admittedly, with males there seems to be a somewhat statistically significant relationship. On the other hand, alcohol use would seem to predict both higher levels of absences and lower scores, although here too the difference between males and females is significant.

Since there is no theoretical reason for this division, it raises some questions over the data. As such, before delving into the numbers further, we need to examine the data more to see if the cause for these variations between sexes can be explained by abnormalities in the observations. Immediately two observations jump up from the data: in the column where absences are on the Y-axis, we can note two observations, both female, that could constitute outliers. To examine this further, we will carry out a regression analysis where the absences are the explanatory variable for final score, and a regression analysis where the alcohol use is the explanatory variable for absences. Both analysis will be then subjected to the residuals vs leverage test from last week, which will help us indicate whether some of the datapoints have an unreasonably high power to pull the models’ predictions outwards towards themselves.

## 
## Call:
## lm(formula = G3 ~ absences, data = TheData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.7235  -1.6055   0.3355   2.3355   6.8688 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.72350    0.21879  53.583   <2e-16 ***
## absences    -0.05897    0.03094  -1.906   0.0574 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.299 on 380 degrees of freedom
## Multiple R-squared:  0.009471,   Adjusted R-squared:  0.006865 
## F-statistic: 3.634 on 1 and 380 DF,  p-value: 0.05738

## 
## Call:
## lm(formula = absences ~ alc_use, data = TheData)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.417 -3.442 -1.442  1.576 41.558 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.2523     0.5918   3.806 0.000165 ***
## alc_use       1.1901     0.2779   4.282 2.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.343 on 380 degrees of freedom
## Multiple R-squared:  0.04603,    Adjusted R-squared:  0.04352 
## F-statistic: 18.33 on 1 and 380 DF,  p-value: 2.349e-05

It seems that the two data points have such a high leverage, as to bring their validity into question. Of course, in the absence of reasoned proof that they are invalid, they should be left in. In the interest of this exercise, I have nevertheless decided to apply the rule of thumb that observations with a Cook’s Distance higher than n/4 (where n is the number of observations) can be removed. Let us see what the end result is, after we apply this procedure to the data.

## 
## Call:
## lm(formula = G3 ~ absences, data = TheData_2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.7846  -1.6377   0.2888   2.2888   6.9496 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.78457    0.22435  52.529   <2e-16 ***
## absences    -0.07342    0.03461  -2.121   0.0346 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.28 on 376 degrees of freedom
## Multiple R-squared:  0.01183,    Adjusted R-squared:  0.009198 
## F-statistic:   4.5 on 1 and 376 DF,  p-value: 0.03455
## 
## Call:
## lm(formula = absences ~ alc_use, data = TheData_2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.885 -3.395 -1.397  1.603 41.595 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.4130     0.5349   4.511 8.63e-06 ***
## alc_use       0.9921     0.2533   3.917 0.000107 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.791 on 376 degrees of freedom
## Multiple R-squared:  0.0392, Adjusted R-squared:  0.03664 
## F-statistic: 15.34 on 1 and 376 DF,  p-value: 0.0001066

The new Dimensions

## [1] 378  35

With the removal of just four observations with a Cook’s distance higher than 4/n, as testified by the new dimensions, we can see that absences now function as a statistically significant predictor of academic performance. I would argue that despite the absence of observation specific reasons supporting the removal of the observations, the overall logical expectation that presence at class predicts performance, and the magnitude of change in the statistical significance of the results, warrants the removal of these values. As such, moving onward, this analysis relies on the now modified dataset.

Finally, to test these results against just the effects of alcohol use on the final grade and the effects of absences, given high alcohol use, we will conduct two more regression analysis:

## 
## Call:
## lm(formula = G3 ~ alc_use, data = TheData_2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.4094  -1.8989   0.1011   2.1011   6.3459 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  12.3884     0.3645  33.984  < 2e-16 ***
## alc_use      -0.4895     0.1726  -2.836  0.00482 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.265 on 376 degrees of freedom
## Multiple R-squared:  0.02094,    Adjusted R-squared:  0.01833 
## F-statistic: 8.041 on 1 and 376 DF,  p-value: 0.004819
## 
## Call:
## lm(formula = G3 ~ absences, data = filter(TheData_2, alc_use > 
##     3))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.8329  -0.7404   1.1671   1.8142   5.6294 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.83286    0.93802  11.549 1.72e-13 ***
## absences    -0.04622    0.11495  -0.402     0.69    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.606 on 35 degrees of freedom
## Multiple R-squared:  0.004599,   Adjusted R-squared:  -0.02384 
## F-statistic: 0.1617 on 1 and 35 DF,  p-value: 0.69

Looking at the results of the regression analysis, we can see that despite all the work that went into removing the high Cook’s Distance observations, alcohol use on its own is still a stronger and statistically more significant predictor of poorer academic performance than the number of absences. We also see that the number of absences given high alcohol use does not provide anything better in terms of predictive power than just absences. As such, contrary to the originally proposed mechanism, while alcohol use is a statistically significant and strong predictor of absences (one move up the alcohol use scale corresponds to almost one full day of additional absences), absences themselves do not function as a strong predictor of poorer academic performance. In fact absences only explain approximately half of the change in final grade that is explained by alcohol use. We can as such conclude that there is strong evidence that while alcohol use results in poorer performance, it does not do that through absences.

C and D

As expected, both negative family relations and free time are statistically significant predictors of alcohol use. We can examine these in more detail with linear regression, as has been done below. We can see that both variables are statistically significant predictors of alcohol use. As for the hypothesized impact of poor family relations, given lots of free time, it does not have an effect larger than just poor family relations. In fact, given free time, poor family relations seem to have a lower effect, but this difference is not statistically significant:

## 
## Call:
## lm(formula = alc_use ~ freetime, data = TheData_2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1822 -0.8364 -0.1822  0.5095  3.1636 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.31755    0.16884   7.803 6.02e-14 ***
## freetime     0.17294    0.05015   3.449 0.000627 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9604 on 376 degrees of freedom
## Multiple R-squared:  0.03066,    Adjusted R-squared:  0.02808 
## F-statistic: 11.89 on 1 and 376 DF,  p-value: 0.0006273
## 
## Call:
## lm(formula = alc_use ~ famrel, data = TheData_2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.2930 -0.8469 -0.2576  0.4924  3.2778 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.43567    0.22189  10.977  < 2e-16 ***
## famrel      -0.14269    0.05497  -2.596  0.00981 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9668 on 376 degrees of freedom
## Multiple R-squared:  0.0176, Adjusted R-squared:  0.01499 
## F-statistic: 6.738 on 1 and 376 DF,  p-value: 0.009808
## 
## Call:
## lm(formula = alc_use ~ famrel + freetime, data = TheData_2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.3850 -0.7974 -0.2116  0.5067  3.0067 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.92585    0.25381   7.588  2.6e-13 ***
## famrel      -0.17339    0.05452  -3.180 0.001594 ** 
## freetime     0.19586    0.05007   3.912 0.000109 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.949 on 375 degrees of freedom
## Multiple R-squared:  0.05612,    Adjusted R-squared:  0.05108 
## F-statistic: 11.15 on 2 and 375 DF,  p-value: 1.982e-05
## 
## Call:
## lm(formula = alc_use ~ famrel, data = filter(TheData_2, freetime > 
##     3))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.20823 -1.00998 -0.07606  0.85786  2.99002 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.34040    0.45699   5.121 9.64e-07 ***
## famrel      -0.06608    0.11041  -0.599     0.55    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.102 on 143 degrees of freedom
## Multiple R-squared:  0.002499,   Adjusted R-squared:  -0.004477 
## F-statistic: 0.3582 on 1 and 143 DF,  p-value: 0.5504

As such, we can conclude section 4 by summarizing that while alcohol has a negative effect of academic performance, and poor family relations and free time increase alcohol consumption, all of these have (while statistically significant) only have modest impact, if we look at R2: Alcohol use only explains approximately 2% of the variance in the final grades, while poor family relations and free time, even when taken together, only explain approximately 5% of the variance in alcohol consumption. As such, while we have some proof of causal relationships, those relationships are not strong. We can additionally reject the hypothesis that the mechanism by which alcohol consumption affects grades is the increased amount of absences.

5.
Logistic Regression of the above variables.

In the above analysis we have treated alcohol use either as an explanatory variable (A-B) or as a target variable (C-D) in a linear function. Here, alcohol use will be defined as a binomial variable, where individuals having an alcohol consumption higher than 2/low, will be labeled as “alcoholics.” As such, individuals with alc_use of three or higher will belong to the category “alcoholics,” while those with less will not. To model the other above variables within this framework will require the use of Logistic Regression, which calculates the probability of an individual belonging to a category (here, alcoholics), based on the model inputs. A probability higher than 0.5 will indicate belonging to a group.

We will employ all the other variables used above, including the variable absences, since it did have a statistically significant relationship with alcohol use. Consequently we get the following Logistic Regression:

## 
## Call:
## glm(formula = alcoholics ~ famrel + freetime + absences + G3, 
##     family = "binomial", data = TheData_2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6726  -0.8342  -0.6456   1.1785   2.0220  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.56725    0.73233  -0.775 0.438585    
## famrel      -0.31613    0.12903  -2.450 0.014280 *  
## freetime     0.41441    0.12521   3.310 0.000934 ***
## absences     0.07175    0.02371   3.027 0.002473 ** 
## G3          -0.06959    0.03593  -1.937 0.052781 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 455.91  on 377  degrees of freedom
## Residual deviance: 424.42  on 373  degrees of freedom
## AIC: 434.42
## 
## Number of Fisher Scoring iterations: 4

The Odds Ratios and Their Confidence Intervals

##                    OR     2.5 %    97.5 %
## (Intercept) 0.5670801 0.1325241 2.3671826
## famrel      0.7289627 0.5649509 0.9387214
## freetime    1.5134784 1.1886426 1.9440711
## absences    1.0743848 1.0268123 1.1277893
## G3          0.9327736 0.8689267 1.0008341

In the above summary we can see that the variables used have a wide range of statistical significance. As the commonly accepted cut-off point for statistical significance is a p score of less than 0.05, alongside absolute z values higher than 2 and 95% confidence intervals that do not include 1, we can conclude that the variable G3, or the final grade, has no statistical significance on our model. As such, we can drop it going forward. * (Confidence intervals going across 1 indicate that the 95% confidence interval contains the coefficient 1, indicating no relationship between the predictor and target variable.)

The odds ratios support our initial hypothesis. Since odds ratios higher than 1 indicate that the variable is positively correlated with the observation/individual belonging to the group (in this case alcoholics), both free time and absences positively predict belonging to the alcoholics group.

Since higher family relations negatively predict belonging to alcoholics, we can conclude that the hypothesized positive impact of bad family relations holds.

Nevertheless, as stated above, the impact of these variables is minor, falling close to even.

(As final grade is not statistically significant, it has been ignored)

6.
The below numerical and graphical explorations detail the accuracy of the model without variable G3. While the plot would seem to indicate a rather random sorting of predictions, a closer examination carried out through the tabulation of predictions against the data showcase a more nuanced model. It is rather clear that the model over predicts non-alcoholics and if it does predict an alcoholic, there is (approximately) a 50/50 chance of that being prediction being right, but since the majority of cases are non-alcoholics, the model’s training error is “only” 0.29, meaning that 29% of the predictions are incorrect. This is above mere random guessing, or flipping the coin, especially since the alcoholics and non-alcoholics are not split 50/50. Nevertheless, we can see both from the graph and the confusion matrix, that the model misses many, many cases where the individual does belong to the group “alcoholics.” As such, it is not a good model.

## 
## Call:
## glm(formula = alcoholics ~ famrel + freetime + absences, family = "binomial", 
##     data = TheData_2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7039  -0.8148  -0.6661   1.2152   1.9476  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.33999    0.61829  -2.167 0.030215 *  
## famrel      -0.32602    0.12889  -2.529 0.011424 *  
## freetime     0.41713    0.12437   3.354 0.000797 ***
## absences     0.07582    0.02361   3.211 0.001323 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 455.91  on 377  degrees of freedom
## Residual deviance: 428.17  on 374  degrees of freedom
## AIC: 436.17
## 
## Number of Fisher Scoring iterations: 4

##           prediction
## alcoholics FALSE TRUE
##      FALSE   255   13
##      TRUE     98   12
## [1] 0.2936508

BONUS

By using a ten-fold cross-validation, we can train the data on one-tenth of “TheData” and then check its accuracy (defined by the ratio of incorrect guesses as above) against the remaining nine sets of one-tenth of the data. This is done below. The ratio of 0.3 indicates that the model that was trained on one tenth of the data performs similarly within the rest of the data compared to the model trained on the whole data. It is worse than the one introduced in the DataCamp. I was able to find a better one after having played around with the above variables in connection with sex, failures and goout.

## [1] 0.3042328

I was able to find a better model after having played around with the above variables in connection with sex, failures and goout. This has an error rate of 0.24 in a ten-fold cross-validation

## [1] 0.2433862

THE END!


Introduction to Open Data Science - Week 4, Linear Discriminant Anlaysis and K-Means

2. The “Boston”-dataset

The dataset, “Boston,” used in this analysis can be dowloaded with the “MASS”-package. As such, it can be seen as a training dataset of sorts. It contains 14 variables with a (potential) connection to housing values in the suburbs of Boston. These variables are:

Variable Explanation
“crim” per capita crime rate by town.
“zn” proportion of residential land zoned for lots over 25,000 sq.ft.
“indus” proportion of non-retail business acres per town.
“chas” Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
“nox” nitrogen oxides concentration (parts per 10 million).
“rm” average number of rooms per dwelling.
“age” proportion of owner-occupied units built prior to 1940.
“dis” weighted mean of distances to five Boston employment centres.
“rad” index of accessibility to radial highways.
“tax” full-value property-tax rate per $10,000.
“ptratio” pupil-teacher ratio by town.
“black” 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
“lstat” lower status of the population (percent).
“medv” median value of owner-occupied homes in $1000s.
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

Each variable has 506 observations/points of data and the below computational analysis would seem to indicate that it contains no empty points of data and each of the 14x506=7084 observations is numeric/integer. Some observations are ratios, some percentages, and at least one (chas) is a dummy variable coded 0/1.

Empty <- 0
for (x in row(Boston)){
  if (is.na(x) == TRUE){
    Empty <- Empty+1
  }
}

Numer <- 0
for (x in row(Boston)){
  if (is.numeric(x) == TRUE){
    Numer <- Numer+1
  }
}

3. The Graphical Overview of the Data.

Below, the reader can find the simple bar graphs of each variable, as well as a summary of each examined variable:

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

As the above general overview indicates, the data takes various values in various ranges - as one would expect from a dataset containing various different measures. Commenting on all the distributions seems pointless at first sight, but most of the graphs indicate some interesting things.

To begin with, we see the age-graph indicate an aging city. More importantly we see that its automatic scale seems off. Either there is a high amount of areas in the city where the proportion of buildings built before 1940 close to 100, or the dataset has a typo. Or, as a final thought that seems the most likely: many of properties surveyed for this dataset come from the same Boston town/area and hence share exactly the same variable observations for some area-specific variables.

We see the black-graph empty. Yet again, a closer examination (carried out below) shows that the data takes some interesting values. I do not have the knowledge to say what the measure “1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town” should indicate, but a approx. 120 of the properties seem to get a value close to 397, while other values occur only once. In fact, the Summary statistic presented above shows that the value is probably 396,90 and that the variable is also curious since most of its observations are within the range of 370 and 400, but its smallest observation is 0.32. It might be that the small observation has not been multiplied by 1000 as per the formula, since the result would come close to the expected range. Yet again, this can be a sign of a typo, or something else going on. And as above, the repetition of one value might be explained by many of properties surveyed for this dataset coming from the same Boston town/area and hence sharing exactly the same variable observations for some area-specific variables.

The Crim-graph also looks empty, but again, the more detailed look below shows that most values range from 0 to 1, as one would expect from a per-capita rate. The fact that the general graph above has a range of 0 to 75 would indicate a typo or some placeholder value. Perhaps certain observations have been multiplied by hundred to give a percentage, or the person recording the obervation has forgotten a dot/comma. This would seem to be the case based on the summary statistic, since the max-values are high above the median and mean.

The dis-graph looks empty as well, but the below closer look shows the granular level of observations. With no aggregation, the single lines disappear from the graph when it is extended to contain all values. The summary statistic indicates nothing out of the ordinary.

The final empty-appearing graph, indus, seems to suffer from the same issue as the black-graph. As one can see below, most value-counts range between 0 and 10, but around the 17-mark there seems to be one value with a high count, approx 120, of observations. This same issue can also be observed in the tax-, rad-, and p-ratio- graphs, although no detailed look is carried out below. This is further evidence for the fact many of properties surveyed for this dataset might come from the same Boston town/area and hence share exactly the same variable observations for some area-specific variables.

Finally, the zn-graph looks odd as well, but I would argue that that is just the result of strict zoning-laws prohibiting the amount of large properties in most areas (observe the large count in value “0”)

As for the relationships between the variables, the below matrix shows the correlations of each varibale paired with each of the ohters. Of note is the fact that the matrix seemingly indicates that each value has some statistically significant relationship with one each of the other variables. Of note is the only exception, the chas-variable, which is also the only dummy variable. An interesting question in this regard is why the chas variable is the only one that does not have statistically significant correlations with most other variables. The first answer is simple: because the fact that a tract bounds the river has no statistically significant impact on many of the other variables. The second option comes down to the inner workings of R - it might be that the cor.mtest-function used here to map p-values does not function well for dummy variables. No mention of this possibility is given by the ?cor.mtest-command.

On the other hand, it is perhaps not surprising that variables that are expected to be significant predictors of housing prices, also have statistically significant correlations which one-another. Out of these correlations a few should be highlighted in preparation for the coming faces. The variable “crim” (per-capita crime rate by town) seems to a a strong, statistically significant positive correlation with high property-tax properties, as well as properties with easier access to radial highways. Property-crime is a good explanatory factor for these correlations - high tax- and rad-variables indicate high-value targets(former) and/or easy get-away and access options(latter).

higher levels of industry (indus), house age (age), air pollution (nox), pupil-to-teacher ratio (ptratio), and population’s lower status all have a weaker positive correlation with crime rate. This perhaps indicates a second category of neighborhoods compared to the above: the older impoverished industry areas with less access to good education.

Both higher median value and longer distance from employment centers correlated weakly and negatively with a higher crime rate. I have a hard time explaining this. Perhaps it is due to the existence of middle-class suburbs, which are not attractive to property theft due to distance to a poor city center? This conclusion is perhaps supported by the strong negative correlation between the dis-variable on one hand and the indus-, nox-, and age-variables, which would seem to indicate that the (employment) centers of the city are older industry neighborhoods. All of this is of course anecdotal in the absence of clearer information.

Finally, the black-varibale seems to be negatively and weakly correlated with a higher crime rate, but as I do not understand the calculations behind the variable, it is rather hard to interpret the (potential) meaning of the correlation - as such I will drop it going forward.

4. Standardization and Categorization

Below the reader can find a summary of the standardized Boston variables. All of them can be seen to share a mean of zero, which is of course by definition a feature of a standardized variable. They are also all on the same scale now, which means that they can be compared to one another easier - although that would not be immediately clear from the data, since the value-distributions still retain their curious aspects: for example with the variable “black,” the min is still far-far-far to the left from the rest of the data. Additionally, the standardized binomial variable “chas” has arguably become non-sensical. The old value of 0 has been replaced by -0.2723 and the old value for 1 has been replaced by 3.6648.

It should also be noted that none of the variables can be fully standardized into standard normal distributions, since they do not adhere to a normal distribution to begin with. This is, at least in, probably due to the (theorized) over representation of one neighborhood in the dataset.

##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

The reader can also note that in this second set, the crim-varibale has been replaced by the Crime factor-variable, as per instructions, and the chas-variable has be returned to its original binomial state. Even further down, the reader can finally find the test set with the removed Crime variable, after the correct answers had been saved.

##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
##        zn               indus              nox                rm         
##  Min.   :-0.48724   Min.   :-1.5563   Min.   :-1.4644   Min.   :-3.8764  
##  1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.9121   1st Qu.:-0.5681  
##  Median :-0.48724   Median :-0.2109   Median :-0.1441   Median :-0.1084  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.: 0.5981   3rd Qu.: 0.4823  
##  Max.   : 3.80047   Max.   : 2.4202   Max.   : 2.7296   Max.   : 3.5515  
##       age               dis               rad               tax         
##  Min.   :-2.3331   Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127  
##  1st Qu.:-0.8366   1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668  
##  Median : 0.3171   Median :-0.2790   Median :-0.5225   Median :-0.4642  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.9059   3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294  
##  Max.   : 1.1164   Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964  
##     ptratio            black             lstat              medv        
##  Min.   :-2.7047   Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.4876   1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.2746   Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.8058   3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 1.6372   Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865  
##      Crime      Boston.chas     
##  Lowest :127   Min.   :0.00000  
##  Lower  :126   1st Qu.:0.00000  
##  Higher :126   Median :0.00000  
##  Highest:127   Mean   :0.06917  
##                3rd Qu.:0.00000  
##                Max.   :1.00000
##        zn               indus               nox                 rm         
##  Min.   :-0.48724   Min.   :-1.42220   Min.   :-1.46443   Min.   :-3.8764  
##  1st Qu.:-0.48724   1st Qu.:-0.74002   1st Qu.:-0.77189   1st Qu.:-0.7769  
##  Median :-0.48724   Median : 0.01796   Median : 0.04578   Median :-0.2279  
##  Mean   :-0.09483   Mean   : 0.18062   Mean   : 0.14941   Mean   :-0.1578  
##  3rd Qu.:-0.48724   3rd Qu.: 1.01499   3rd Qu.: 0.87424   3rd Qu.: 0.3033  
##  Max.   : 3.58609   Max.   : 2.42017   Max.   : 2.72965   Max.   : 3.0078  
##       age               dis               rad                tax          
##  Min.   :-2.2017   Min.   :-1.2623   Min.   :-0.98187   Min.   :-1.30676  
##  1st Qu.:-0.3810   1st Qu.:-0.8736   1st Qu.:-0.63733   1st Qu.:-0.76385  
##  Median : 0.5284   Median :-0.5047   Median :-0.52248   Median :-0.03107  
##  Mean   : 0.1834   Mean   :-0.1425   Mean   : 0.04499   Mean   : 0.11825  
##  3rd Qu.: 0.9707   3rd Qu.: 0.4605   3rd Qu.: 1.65960   3rd Qu.: 1.52941  
##  Max.   : 1.1164   Max.   : 2.5777   Max.   : 1.65960   Max.   : 1.79642  
##     ptratio               black              lstat               medv         
##  Min.   :-2.5199404   Min.   :-3.52291   Min.   :-1.36857   Min.   :-1.90634  
##  1st Qu.:-0.7185093   1st Qu.: 0.18324   1st Qu.:-0.67470   1st Qu.:-0.72934  
##  Median : 0.3207778   Median : 0.37670   Median : 0.05138   Median :-0.25908  
##  Mean   :-0.0002918   Mean   : 0.06868   Mean   : 0.20566   Mean   :-0.05186  
##  3rd Qu.: 0.8057784   3rd Qu.: 0.43832   3rd Qu.: 0.76346   3rd Qu.: 0.26826  
##  Max.   : 1.2676838   Max.   : 0.44062   Max.   : 3.54526   Max.   : 2.98650  
##   Boston.chas     
##  Min.   :0.00000  
##  1st Qu.:0.00000  
##  Median :0.00000  
##  Mean   :0.06863  
##  3rd Qu.:0.00000  
##  Max.   :1.00000

5. Linear Discriminant Analysis (LDA)

Despite the fact that none of the variables adhere to the assumption normal distribution required by the LDA, nor is the Chas-variable continuous as is usually expected, below the reader can find the required LDA-(bi)plot. It contains the categorical crime rate as the target variable and all of the remaining variables as predictor variables (even the black-variable, despite what I said earlier about not using it.) Observing both the biplot and LDA data, we can see that LD1 explains 96 percent of the between-group variance, while LD2 explains three percent and LD3 one percent.

## Call:
## lda(Crime ~ ., data = TrainSet)
## 
## Prior probabilities of groups:
##    Lowest     Lower    Higher   Highest 
## 0.2623762 0.2500000 0.2450495 0.2425743 
## 
## Group means:
##                  zn       indus        nox          rm        age        dis
## Lowest   0.94873938 -0.89313923 -0.8545752  0.42801907 -0.8755807  0.8451251
## Lower   -0.06165277 -0.31403667 -0.5964211 -0.08767888 -0.3996310  0.3987432
## Higher  -0.37290116  0.08583784  0.3417213  0.12938920  0.3487895 -0.3356536
## Highest -0.48724019  1.01499462  1.0382944 -0.33907005  0.8156924 -0.8377148
##                rad        tax     ptratio       black       lstat        medv
## Lowest  -0.6752522 -0.7315526 -0.42960060  0.38432721 -0.77989661  0.50243720
## Lower   -0.5361295 -0.4971113 -0.04760321  0.31681294 -0.19061341  0.01925548
## Higher  -0.4192384 -0.3453652 -0.28879737  0.09365932 -0.01323016  0.17973617
## Highest  1.6596029  1.5294129  0.80577843 -0.90831104  0.83932500 -0.69089518
##         Boston.chas
## Lowest   0.03773585
## Lower    0.06930693
## Higher   0.12121212
## Highest  0.05102041
## 
## Coefficients of linear discriminants:
##                     LD1          LD2         LD3
## zn           0.08866773  0.718788171 -0.94589532
## indus        0.05331710 -0.132123462  0.44938777
## nox          0.17161192 -0.906743857 -1.38261185
## rm          -0.11183753 -0.098954475 -0.09108170
## age          0.26550601 -0.291632819 -0.14216674
## dis         -0.09554650 -0.350826244  0.26710102
## rad          3.62497052  0.984486539 -0.03752939
## tax          0.14286346 -0.098355568  0.44636102
## ptratio      0.11231641  0.040946150 -0.24712381
## black       -0.12587821  0.002389311  0.11140499
## lstat        0.17795303 -0.298690391  0.44507952
## medv         0.18192519 -0.448381368 -0.16503517
## Boston.chas -0.38750819 -0.261604065  0.38667135
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9601 0.0295 0.0104

6. LDA for Prediction

The table below showcases the cross-tabulated results of the predictions against the actual categories. We can see that the model predicts rather accurately the group belonging in for properties in higher and high crime rate areas, while it struggles a bit more in the lower and lowest categories. Indeed the model seems to slightly over-predict higher crime rates, especially when provided with data of a property in a lower/lowest crime rate area. Nevertheless, it out-performs simple quessing, under which a property would have an almost equal 25% chance in belonging to any of these categories, as indicated in the previous section’s model output. As such, a simple random division of the properties into four equal-sized groups would result, on average, in three incorrect predictions per one correct prediction. Such odds are much worse than the odds for the model correctly predicting a property belonging to a lowest crime rate area.

##          predicted
## correct   Lowest Lower Higher Highest
##   Lowest      14     7      0       0
##   Lower        4    16      5       0
##   Higher       0     8     17       2
##   Highest      0     0      1      28

7. K-Means Analysis

The below two graphs showcase the results of the final K-Means analysis. The first graph details the change the total within cluster sum of squares as we increase the amount of clusters from 1 to 14. The aim is to use the graph to find the optimal amount of clusters. As it is clear that a more granular level will lead to smaller within cluster sum of squares (WCSS) without necessarily being a better grouping devise (Consider for example that the smallest within cluster sum of squares comes from having only one observation in each “cluster”, meaning that no clustering has been done), we need to find a point where the WCSS drops drastically, indicating an amount of clusters that is significantly more precise than a smaller amount, but not significantly less precise than a larger amount. The first graph indicates that that point is two (2) clusters.

As for the pairs analysis produced by the clustering of pairs of variables into two clusters, we will only discuss the top row/first column of the graphs, which relate to the crime rate. This is done for purposes of limiting the discussion to the relevant aspects and not covering each of the 182 squares. What we need to keep in mind is that K-Means analysis that clusters into two groups attempts to find sets of two sets of data, where the total (in this case) euclidean absolute distances to the group mean are the smallest. If we were to have a single group which shares many of it observation values, then it would be expected, that such a group would repeat itself in each graph. And, indeed, we see most of the crime-graphs maintain a very similar, flat/narrow red-group structure throughout the groupings. To me, this is further evidence that the 120 uncommonly consistently-valued observations that section 2 identified in multiple variables come from a single group of properties from the same area. Perhaps the data showcase something else as well, but hopeflly this will suffice. This is already a long text.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.2663  4.6116  4.7275  5.9572 13.8843


Introduction to Open Data Science - Week 5, Dimension Reduction Techniques

1. Data Exploration

This week’s data analysis mostly deals with the human-dataset. The data contained a number of variables mapping human development and gender equality in 195 countries. This data was then cut down to eight variables. They are the following variables measuring development and gender equality in 195 countries:
+ Life.Expectancy refers to a citizens average life expectancy at birth
+ Education.Years is the amount of years a citizen is planned to spend in education
+ GNI.Per.Capita is the Gross National INcome adjusted for population
+ Maternal.Mortality is a ratio of deaths per 100,000 births
+ Teen.Birth is the number of births per 1,000 women ages 15 to 19.
+ Parliamentary.Participation is the ratio of women to men in parliament
+ Female.Secondary.Education is the percentage of females that attend secondary education
+ Female.Work is the percentage of females that participate in the labour force

Below are simple bar-plots graphically showcasing the variables used this week. As the graphs show, most of these variables do not follow a gaussian distribution, meaning that they are not bell-curves. The only exception to this is the variable mapping the planned years a citizen is supposed to spend in school.

Even further down, the reader can find the correlation plot, where "*" indicates statistical significance to the 0.05-level. As can be seen, three sets of variables emerge from the data:
+ Female.Work and Parliamentary.Participation, which are not strongly correlated with anything
+ Maternal.Mortality and Teen.Birth, which are strongly positively correlated only with one another.
+ The remaining variables on education, GNI, and life-expectancy, which are (semi) strongly positively correlated with each other and strongly negatively correlated with the preceding set of variables.

##  Female.Secondary.Education  Female.Work    Life.Expectancy Education.Years
##  Min.   :  0.90             Min.   :13.50   Min.   :49.00   Min.   : 5.40  
##  1st Qu.: 27.15             1st Qu.:44.45   1st Qu.:66.30   1st Qu.:11.25  
##  Median : 56.60             Median :53.50   Median :74.20   Median :13.50  
##  Mean   : 55.37             Mean   :52.52   Mean   :71.65   Mean   :13.18  
##  3rd Qu.: 85.15             3rd Qu.:61.90   3rd Qu.:77.25   3rd Qu.:15.20  
##  Max.   :100.00             Max.   :88.10   Max.   :83.50   Max.   :20.20  
##  GNI.per.Capita   Maternal.Mortality   Teen.Birth    
##  Min.   :   581   Min.   :   1.0     Min.   :  0.60  
##  1st Qu.:  4198   1st Qu.:  11.5     1st Qu.: 12.65  
##  Median : 12040   Median :  49.0     Median : 33.60  
##  Mean   : 17628   Mean   : 149.1     Mean   : 47.16  
##  3rd Qu.: 24512   3rd Qu.: 190.0     3rd Qu.: 71.95  
##  Max.   :123124   Max.   :1100.0     Max.   :204.80  
##  Parliamentary.Participation
##  Min.   : 0.00              
##  1st Qu.:12.40              
##  Median :19.30              
##  Mean   :20.91              
##  3rd Qu.:27.95              
##  Max.   :57.50

2. Pricipal Component Analysis on Non-Standardized Data

Below, Principal Component Analysis, relying on Singular Value Decomposition, is applied to the non-standardized “Human”-data explored above. This analysis is expected to turn out somewhat skewed, as PCA relies on variance to identify the Principal Components. Consequently, variables with larger average values will have an unreasonably high impact on the analysis.

## Standard deviations (1, .., p=8):
## [1] 18544.172057   186.283543    25.972416    20.074641    14.321634
## [6]    10.634338     3.721257     1.428190
## 
## Rotation (n x k) = (8 x 8):
##                                       PC1          PC2           PC3
## Female.Secondary.Education  -9.317458e-04  0.085169068 -0.3652346210
## Female.Work                  9.124960e-05 -0.031442122 -0.0180958993
## Life.Expectancy             -2.815825e-04  0.028224407 -0.0170864406
## Education.Years             -9.562916e-05  0.007556395 -0.0210421918
## GNI.per.Capita              -9.999828e-01 -0.005830843  0.0006412388
## Maternal.Mortality           5.655739e-03 -0.987500960 -0.1481355205
## Teen.Birth                   1.233962e-03 -0.125305410  0.9184673154
## Parliamentary.Participation -5.526462e-05  0.003211742 -0.0038388487
##                                       PC4           PC5           PC6
## Female.Secondary.Education  -0.8435797499 -3.770012e-01 -6.083913e-02
## Female.Work                 -0.3928157700  8.233860e-01  4.077784e-01
## Life.Expectancy             -0.0212996883  1.136966e-02 -6.669649e-02
## Education.Years             -0.0308785262  3.896982e-03 -2.437473e-02
## GNI.per.Capita               0.0002381021  6.651486e-06  2.062449e-05
## Maternal.Mortality          -0.0173448186 -3.955532e-02 -2.010447e-02
## Teen.Birth                  -0.3475453954 -1.381061e-01 -2.499459e-02
## Parliamentary.Participation -0.1075784134  3.989026e-01 -9.077136e-01
##                                       PC7           PC8
## Female.Secondary.Education   0.0303039622  3.118655e-02
## Female.Work                 -0.0093667016  6.654689e-03
## Life.Expectancy             -0.9901494274  1.161211e-01
## Education.Years             -0.1134676761 -9.925031e-01
## GNI.per.Capita               0.0001028639  1.871381e-05
## Maternal.Mortality          -0.0244043639 -7.101321e-04
## Teen.Birth                  -0.0127951595 -8.079546e-03
## Parliamentary.Participation  0.0704544742  1.925677e-02
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6   PC7   PC8
## Standard deviation     1.854e+04 186.2835 25.97 20.07 14.32 10.63 3.721 1.428
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00  0.00  0.00 0.000 0.000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00  1.00  1.00 1.000 1.000

As expected, the graph can tell us very little, and PC1 purports to explain (essentially) all of the variation. Let us standardize the data and try again.

3. Pricipal Component Analysis on Standardized Data

By standardizing the variables in the “Human”-data, the PCA is able to work on variables with comparable variances. Accordingly, variables on scales where high values are expected will not have a disproportionately high impact on the calculations. As can be seen below, the biplot produced by the standardized data can offer valuable insights into the data.

## Standard deviations (1, .., p=8):
## [1] 2.1359720 1.1170602 0.8767066 0.7202736 0.5530904 0.5299143 0.4493522
## [8] 0.3372776
## 
## Rotation (n x k) = (8 x 8):
##                                     PC1         PC2          PC3         PC4
## Female.Secondary.Education  -0.39874523 0.101737848 -0.210481695 -0.31033900
## Female.Work                  0.14439668 0.682158952 -0.575062704 -0.27300332
## Life.Expectancy             -0.43080653 0.003997077  0.073302641 -0.02783772
## Education.Years             -0.41858363 0.138149663 -0.073869337 -0.07719277
## GNI.per.Capita              -0.33914119 0.098797891 -0.338769060  0.84020987
## Maternal.Mortality           0.41991628 0.123208094 -0.145471957  0.28520785
## Teen.Birth                   0.40271826 0.088902996 -0.003213286  0.11830579
## Parliamentary.Participation -0.07626861 0.687286162  0.691544283  0.14537131
##                                     PC5         PC6         PC7         PC8
## Female.Secondary.Education   0.43437742 -0.50478735  0.48033587  0.12578655
## Female.Work                 -0.22413576  0.23657515  0.01076780 -0.04754207
## Life.Expectancy              0.04920168  0.57970641  0.04122211  0.68415054
## Education.Years              0.30831448 -0.11017948 -0.81713841 -0.13919229
## GNI.per.Capita              -0.01249472  0.01564322  0.16310711 -0.16583233
## Maternal.Mortality           0.05493343 -0.43780233 -0.24286926  0.67254029
## Teen.Birth                   0.81248359  0.36928738  0.07464930 -0.11761127
## Parliamentary.Participation -0.01724555 -0.11284562  0.09265604 -0.02894539
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5    PC6     PC7
## Standard deviation     2.1360 1.1171 0.87671 0.72027 0.55309 0.5299 0.44935
## Proportion of Variance 0.5703 0.1560 0.09608 0.06485 0.03824 0.0351 0.02524
## Cumulative Proportion  0.5703 0.7263 0.82235 0.88720 0.92544 0.9605 0.98578
##                            PC8
## Standard deviation     0.33728
## Proportion of Variance 0.01422
## Cumulative Proportion  1.00000

As the caption suggests, the data seems to indicate that while participation of women in the labour-force and political decision-making are important components of gender equality, they do not predict the overall (physical) well-being of women. More of this below. The reason why our standardized data is able to provide more insights into this, is due to the standardized means and variations that it contains. PCA creates its principal components (PC) by calculating a line of best fit through the observations of the variables that account for most of the variance (while maintaining the sum-of-squares for the resulting coefficients at =1). The second PC is calculated from the remaining variables similarly, but in such a way that it does not correlate with the first PC. With standardized data, the variations are not affected by the scales of the units of the variables and are as such comparable - no single variable can account for most of the variance due to the large values it takes. Conversely, with the non-standardized data, the variable measuring the per-capita GNI of the countries had the largest explanatory power, since the values that its variation takes are in the tens-of-thousands. The rest of the variables take much lower values (The second highest absolute values are taken by maternal mortality and even that has a range of 0 to 1000), and consequently they seemingly account for less of the variation in the data. This is why we see PC1 correlate almost 100% with the variable GNI.per.Capita in the biplot run on non-standardized variables and why the said variable seems to have a disproportionately high variation among the dataset.

4. Interpreting PC1 and PC2

As indicated above, high labour and political participation of women are not short hand for high level of well being among women. We can see that the data is principally divided between countries where there are high levels death at childbirth and adolescent births, and countries where there are high levels of life-expectancy, education and GNI. We might say that this pricipal component mostly maps reproductive health. Indeed, the former two variables are highly correlated between each other, which is not a surprise: giving birth at a young (teen) age creates a greater risk of death at childbirth due to the fact that the mother’s body is not yet fully developed. The latter four variables on the other hand predict lower levels of maternal mortality and adolescent birth, partly because higher levels of education are known to predict a lower fertility rates, and a delayed start to getting children among women, partly because higher GNI predicts better health care, both contributing to a higher life-expectancy.

On the other hand, labour and political participation presuppose neither of these sets of variables - they are almost completely uncorrelated with them. This indicates that women participate in the labour force and parliament even in poorer countries with more traditional conceptions of child-rearing a worser health-care. On the other hand, it perhaps also indicates that most countries perform suboptimally, when it comes to the participation of women in politics and the labour force. As such those two variables are bad predictors of development either in terms of money, education or health.

5. Multiple Correspondence Analysis on the Tea Data

The tea-dataset, according to the information provided by ?tea, includes three sets of variables: the first 18 relate to tea drinking habits, the next 5 map personal details (with age featuring twice, once as integers and once as a categorical variable with five categories), and the final 13 questions mapping product perception. For this section I have chosen to divide the variables into two sets: the active, habit-variables being fed into the MCA and the supplementary personal detail variables, which will be added as colors into the plot to see if we can find links between these supplementary variables and the two dimensions of variability showcased in the plot. The product-perception variables will be excluded in this analysis. Furthermore, only the categorical age variable will be included.

Below, simple bar graphs have been created alongside the structure and dimensions of the remaining data. The reader is welcome to explor the graphs, but since their interpretation is not a requested part of this analysis, a few short comments should suffice: - We can see that there is a clear distinction between places where tea is drunk (home, friends) and where and when it is not (pub, restaurant, dinner, lunch, work) - People seem to drink Earl Grey frequently - Slightly more women answered the survey.

## [1] 300  36
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...

Moving on to the actual Multiple Correspondence Analysis. Below, the reader can first find the variable biplot of the MCA, where variables in red denote active variables and variables in green supplementary variables. It is first worth noting that the two dimensions identified by the MCA account together for only 16.7% of the principal inertia, meaning that most of the variation in the data is left unaccounted for. In fact, the summary-statistic shows that increasing the amount of dimensions has little effect on this, with six dimensions only accounting for 36.8% of the principal inertia.

Be that as it may, we will continue to work with these results. The theory that I offer based on the below information is that dimension 1 maps what could be generalised as “tea consumption.” We see the frequency of tea-drinking increase as we move from left to right. Additionally, the variables on the right-hand side of the plot indicate higher levels of tea consumption - to the point that tearooms are visited. The left-hand side showcase answers indicating little interest in tea: the tea is cheap, from the chain store and the people do not generally choose tea as their drink when in restaurants, at home or at friends. Dimension 2 on the other hand maps “tea enjoyment”, or something similar. We see the higher values correspond with upscale tea bought from a teashop in loose-form. The lower variables reflect a more basic approach to tea. But as said, even if correct, these dimensions and their interpretations do not explain a lot of the variance in the data.

Consequently, it is not surprising that when we apply color to the plots to map the respondents by their belonging to different categories in the four supplementary variables, we have a hard time identifying any significant linkages. Age does not seem to be linked to either of the dimensions identified. With sports, there seems to be indication of individuals that do not take part in sport being also lesser tea drinkers, but the indications for this is weak. Similarly, students and elderly seem to drink more tea when it comes to employment status. Perhaps the most clearest linkage is between gender and dimension 1, where females are more prominent at the right-hand side of the scale, perhaps indicating that women tend to be bigger casual tea drinkers, with men more often opting for other drinks, or drinking finer teas, as indicated by the higher amount of men at the top of dimension two.

Overall, these plots are not helpful. Primarily due to their low ability to account for variation in the data, and also, perhaps consequently, due to the absence of other linkages that they might have identified.

## 
## Call:
## MCA(X = Tea_Who_And_How, quanti.sup = NULL, quali.sup = 19:22,  
##      graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.148   0.116   0.094   0.078   0.073   0.072   0.067
## % of var.              9.392   7.335   5.972   4.957   4.635   4.534   4.233
## Cumulative % of var.   9.392  16.727  22.699  27.655  32.291  36.824  41.057
##                        Dim.8   Dim.9  Dim.10  Dim.11  Dim.12  Dim.13  Dim.14
## Variance               0.065   0.060   0.059   0.058   0.055   0.052   0.050
## % of var.              4.106   3.829   3.719   3.672   3.500   3.319   3.193
## Cumulative % of var.  45.163  48.992  52.711  56.383  59.883  63.202  66.395
##                       Dim.15  Dim.16  Dim.17  Dim.18  Dim.19  Dim.20  Dim.21
## Variance               0.049   0.046   0.045   0.044   0.039   0.037   0.035
## % of var.              3.094   2.932   2.841   2.785   2.480   2.330   2.210
## Cumulative % of var.  69.488  72.421  75.262  78.047  80.527  82.857  85.067
##                       Dim.22  Dim.23  Dim.24  Dim.25  Dim.26  Dim.27  Dim.28
## Variance               0.034   0.034   0.029   0.028   0.027   0.026   0.024
## % of var.              2.149   2.132   1.860   1.755   1.690   1.637   1.529
## Cumulative % of var.  87.216  89.348  91.208  92.963  94.653  96.290  97.820
##                       Dim.29  Dim.30
## Variance               0.019   0.016
## % of var.              1.196   0.984
## Cumulative % of var.  99.016 100.000
## 
## Individuals (the 10 first)
##                  Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
## 1             | -0.573  0.739  0.160 | -0.182  0.095  0.016 | -0.348  0.427
## 2             | -0.381  0.326  0.139 | -0.127  0.047  0.016 | -0.604  1.288
## 3             |  0.095  0.020  0.006 | -0.143  0.059  0.013 |  0.303  0.325
## 4             | -0.634  0.904  0.281 |  0.009  0.000  0.000 |  0.141  0.070
## 5             | -0.138  0.043  0.023 | -0.100  0.029  0.012 | -0.105  0.039
## 6             | -0.734  1.212  0.265 |  0.018  0.001  0.000 | -0.027  0.003
## 7             | -0.084  0.016  0.008 | -0.158  0.072  0.027 | -0.081  0.023
## 8             | -0.262  0.154  0.054 | -0.022  0.001  0.000 | -0.032  0.004
## 9             |  0.198  0.088  0.033 |  0.179  0.092  0.027 | -0.593  1.244
## 10            |  0.344  0.265  0.081 |  0.436  0.546  0.130 | -0.461  0.752
##                 cos2  
## 1              0.059 |
## 2              0.350 |
## 3              0.058 |
## 4              0.014 |
## 5              0.013 |
## 6              0.000 |
## 7              0.007 |
## 8              0.001 |
## 9              0.295 |
## 10             0.146 |
## 
## Categories (the 10 first)
##                   Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## breakfast     |   0.231   0.908   0.049   3.835 |  -0.221   1.062   0.045
## Not.breakfast |  -0.213   0.838   0.049  -3.835 |   0.204   0.980   0.045
## Not.tea time  |  -0.483   3.622   0.181  -7.360 |   0.116   0.269   0.011
## tea time      |   0.375   2.808   0.181   7.360 |  -0.090   0.209   0.011
## evening       |   0.331   1.332   0.057   4.133 |  -0.013   0.003   0.000
## Not.evening   |  -0.173   0.696   0.057  -4.133 |   0.007   0.001   0.000
## lunch         |   0.714   2.656   0.088   5.121 |  -0.352   0.824   0.021
## Not.lunch     |  -0.123   0.457   0.088  -5.121 |   0.060   0.142   0.021
## dinner        |  -0.795   1.568   0.048  -3.769 |   0.827   2.174   0.051
## Not.dinner    |   0.060   0.118   0.048   3.769 |  -0.062   0.164   0.051
##                v.test     Dim.3     ctr    cos2  v.test  
## breakfast      -3.665 |  -0.621  10.329   0.356 -10.315 |
## Not.breakfast   3.665 |   0.573   9.535   0.356  10.315 |
## Not.tea time    1.773 |   0.158   0.607   0.019   2.403 |
## tea time       -1.773 |  -0.122   0.471   0.019  -2.403 |
## evening        -0.169 |   0.510   4.994   0.136   6.383 |
## Not.evening     0.169 |  -0.267   2.611   0.136  -6.383 |
## lunch          -2.520 |   0.356   1.037   0.022   2.552 |
## Not.lunch       2.520 |  -0.061   0.178   0.022  -2.552 |
## dinner          3.922 |   0.637   1.586   0.031   3.023 |
## Not.dinner     -3.922 |  -0.048   0.119   0.031  -3.023 |
## 
## Categorical variables (eta2)
##                 Dim.1 Dim.2 Dim.3  
## breakfast     | 0.049 0.045 0.356 |
## tea.time      | 0.181 0.011 0.019 |
## evening       | 0.057 0.000 0.136 |
## lunch         | 0.088 0.021 0.022 |
## dinner        | 0.048 0.051 0.031 |
## always        | 0.060 0.003 0.053 |
## home          | 0.013 0.000 0.131 |
## work          | 0.099 0.049 0.002 |
## tearoom       | 0.344 0.017 0.000 |
## friends       | 0.208 0.011 0.160 |
## 
## Supplementary categories (the 10 first)
##                  Dim.1   cos2 v.test    Dim.2   cos2 v.test    Dim.3   cos2
## F             |  0.168  0.041  3.511 | -0.114  0.019 -2.375 | -0.045  0.003
## M             | -0.245  0.041 -3.511 |  0.166  0.019  2.375 |  0.066  0.003
## employee      | -0.153  0.006 -1.305 | -0.145  0.005 -1.243 |  0.064  0.001
## middle        | -0.010  0.000 -0.071 |  0.312  0.015  2.114 | -0.330  0.017
## non-worker    |  0.005  0.000  0.045 |  0.175  0.008  1.580 | -0.254  0.017
## other worker  | -0.030  0.000 -0.140 |  0.009  0.000  0.041 |  0.063  0.000
## senior        |  0.370  0.018  2.325 |  0.047  0.000  0.297 | -0.106  0.001
## student       |  0.031  0.000  0.299 | -0.279  0.024 -2.664 |  0.391  0.047
## workman       | -0.454  0.009 -1.601 |  0.216  0.002  0.764 |  0.061  0.000
## Not.sportsman | -0.031  0.001 -0.444 |  0.016  0.000  0.222 | -0.036  0.001
##               v.test  
## F             -0.942 |
## M              0.942 |
## employee       0.545 |
## middle        -2.238 |
## non-worker    -2.285 |
## other worker   0.292 |
## senior        -0.667 |
## student        3.732 |
## workman        0.217 |
## Not.sportsman -0.518 |
## 
## Supplementary categorical variables (eta2)
##                 Dim.1 Dim.2 Dim.3  
## sex           | 0.041 0.019 0.003 |
## SPC           | 0.029 0.044 0.066 |
## Sport         | 0.001 0.000 0.001 |
## age_Q         | 0.008 0.066 0.103 |

Introduction to Open Data Science - Week 6, Analysis of Longitudinal Data

The MABS Chapter 8 Analysis with the RATS data.

Introduction

The RATS dataset measures the body weight of three groups of rats at various points in time. The groups are differentiated by the diets they were fed and their weight is measured in grams. Below the reader can find a table of the RATS dataset in wide and long form. The difference between these two forms of data is covered in the data-wrangling exercise.

##     X ID Group WD1 WD8 WD15 WD22 WD29 WD36 WD43 WD44 WD50 WD57 WD64
## 1   1  1     1 240 250  255  260  262  258  266  266  265  272  278
## 2   2  2     1 225 230  230  232  240  240  243  244  238  247  245
## 3   3  3     1 245 250  250  255  262  265  267  267  264  268  269
## 4   4  4     1 260 255  255  265  265  268  270  272  274  273  275
## 5   5  5     1 255 260  255  270  270  273  274  273  276  278  280
## 6   6  6     1 260 265  270  275  275  277  278  278  284  279  281
## 7   7  7     1 275 275  260  270  273  274  276  271  282  281  284
## 8   8  8     1 245 255  260  268  270  265  265  267  273  274  278
## 9   9  9     2 410 415  425  428  438  443  442  446  456  468  478
## 10 10 10     2 405 420  430  440  448  460  458  464  475  484  496
## 11 11 11     2 445 445  450  452  455  455  451  450  462  466  472
## 12 12 12     2 555 560  565  580  590  597  595  595  612  618  628
## 13 13 13     3 470 465  475  485  487  493  493  504  507  518  525
## 14 14 14     3 535 525  530  533  535  540  525  530  543  544  559
## 15 15 15     3 520 525  530  540  543  546  538  544  553  555  548
## 16 16 16     3 510 510  520  515  530  538  535  542  550  553  569
##     X ID Group  WD Weight Time
## 1   1  1     1 WD1    240    1
## 2   2  2     1 WD1    225    1
## 3   3  3     1 WD1    245    1
## 4   4  4     1 WD1    260    1
## 5   5  5     1 WD1    255    1
## 6   6  6     1 WD1    260    1
## 7   7  7     1 WD1    275    1
## 8   8  8     1 WD1    245    1
## 9   9  9     2 WD1    410    1
## 10 10 10     2 WD1    405    1
## 11 11 11     2 WD1    445    1
## 12 12 12     2 WD1    555    1
## 13 13 13     3 WD1    470    1
## 14 14 14     3 WD1    535    1
## 15 15 15     3 WD1    520    1
## 16 16 16     3 WD1    510    1
## 17 17  1     1 WD8    250    8
## 18 18  2     1 WD8    230    8
## 19 19  3     1 WD8    250    8
## 20 20  4     1 WD8    255    8
## 21 21  5     1 WD8    260    8
## 22 22  6     1 WD8    265    8
## 23 23  7     1 WD8    275    8
## 24 24  8     1 WD8    255    8
## 25 25  9     2 WD8    415    8
## 26 26 10     2 WD8    420    8
## 27 27 11     2 WD8    445    8
## 28 28 12     2 WD8    560    8
## 29 29 13     3 WD8    465    8
## 30 30 14     3 WD8    525    8

The Analysis

To initiate the exploration of the data, we can run it through some graphical tools. As MABS chapter 8 notes, these graphical approaches, together with the summary measure approach, provide a quick and dirty way of analysing the data. To start the analysis, we will begin by handling the dataset graphically to extract information.

The groups showcase the “Tracking”-effect, whereby larger starting values predict larger ending values. This effect applies as a probabilistic predictor. Ie. a larger starting value does not determine a larger ending value, it merely makes it more probable. As such, larger starting values do not always translate to larger ending values. But, most often they do.

To really bring out the effects of “tracking,” we need to standardize the data. Understanding the effects of “tracking” is important, because otherwise we are unable to truly delimit the effects of the diet from the effects of the starting size.

The above graphs showcase the effects of both tracking and time on the datasets. In a standardized format, we are able to see that the varying diets have little effect on the weights of the mice. This means that starting size and growth-caused-by-aging explain much of the change. In fact, only group two (if we ignore the full straight line) shows compelling signs of growth brought on by the diet alone. Groups one and three seem to follow the standard pattern.

Moving on to the summary graph, we summarize the data through the means of each group.

As we are interested in the general, or mean effect of the diets on the mice, the above plot provides us with plenty of information. It shows us that group 2 has the fastest growth in weight out of the three groups. Group 3 has the second fastest growth, while group 1 remain almost stagnant.

This is not, though, the end of the analysis. We need to figure out whether certain observations are outliers and consequently contribute excessively to the results. The box plot graph is a good approach to determine this.

As is clear from the data, we are able to observe an outlier in each group. In both groups one and three, the outlier pulls the averages down, whereas with group two it increases it. There are certain situation where this might be problematic - such as when measuring the weights of mice at specific points in time - but if we are interested in the growth rates (as is the case here), these outliers do not seem to have a significant effect. This is because their variance is stable throughout the time frame. As such the outliers do not contribute to the difference in change of outcome between groups.

In the book the summary measure that is applied to the data is the measurement of differences in the means of the groups. This is a good approach for the BPRS data, because it is supposed to track the effectiveness of a treatment. As the RATS-data measures the effect of a diet on the growth of mice, we would be better served by the use of the regression coefficient as the summary measure - as per the MABS chapter 8, page 162. Nevertheless, this comes so close to the work done in chapter 9 that despite the poorer fit of the mean-summary measure, we will apply it here to the RATS data. As such, we will treat the diet as a once-off measure. Conceptually, this approach could be justified - for example - when measuring the effects of the diet of adolescent rats on their development to mature, fully grown rats. In such a case, the effects are cemented in - as if we were talking about the effects of a treatment. Nevertheless, there is no reason to ignore the baseline in this case and as such that part will not be carried out.

What the above graph shows is that variance in the results of groups 1 and 3 are negligeble, while group 2’s variance is somewhat stronger. Also, as noted above, each group has an outlier, but only group 2 has an outlier which is clearly raising the mean of the 2nd group. Accordingly, as we’re examing the means, the second graph showcases the groups with the clear outlier removed. We can see that after the removal of the outlier, the groups have very similar profiles in relation to their means, only their value changes, with group 1 having the smallest, group 3 the largest and group 2 in between. With knowledge of tracking, these results do not really tell us that much about the diet - they simply indicate that the mice were different sizes to start with. What we can do additionally, is to plot the average change in weight for each group from the start of the measurement to its end.

As is clear from these graphs, each group experience growth, but as the second graph shows, only the 2nd groups experiences above-average growth, with Group 1 following the standardized growth and Group 3, in fact, falling behind. This could - theoretically - be due to the diets. To get more formal statistical proof of this relation, we can run the data through a t-test.

##    stdweight          Weight           Time      Group
##  Min.   :0.3105   Min.   :405.0   Min.   : 1.0   1:0  
##  1st Qu.:0.3844   1st Qu.:418.8   1st Qu.: 1.0   2:6  
##  Median :0.5083   Median :458.5   Median :32.5   3:0  
##  Mean   :0.4941   Mean   :451.0   Mean   :32.5        
##  3rd Qu.:0.6038   3rd Qu.:476.5   3rd Qu.:64.0        
##  Max.   :0.6587   Max.   :496.0   Max.   :64.0
## 
##  Two Sample t-test
## 
## data:  Weight by Time
## t = -3.3271, df = 14, p-value = 0.004986
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -38.032466  -8.217534
## sample estimates:
##  mean in group 1 mean in group 64 
##          250.625          273.750
## 
##  Two Sample t-test
## 
## data:  Weight by Time
## t = -4.275, df = 4, p-value = 0.0129
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -102.26643  -21.73357
## sample estimates:
##  mean in group 1 mean in group 64 
##              420              482
## 
##  Two Sample t-test
## 
## data:  Weight by Time
## t = -2.4693, df = 6, p-value = 0.0485
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -82.6240161  -0.3759839
## sample estimates:
##  mean in group 1 mean in group 64 
##           508.75           550.25

The T-test analysis indicates rather clearly, that all differences in mean outcome are statistically significant. The clearly higher statistical significance of group 1 results is explained by the higher number of observations, resulting in more reliable observation of true differences

The incorporation of baseline measures is not done in this analysis, as the focus is on the change in weight - as opposed to the actual weights. The former is not affected by the higher starting weights. Furthermore, as there are no missing values, the tricks used in MABS Chapter 8 are not necessary here.

The MABS Chapter 9 Analysis with the BPRS data.

Introduction

The BPRS dataset covers 40 male subjects who were randomly assigned to one of two treatment groups for psychiatric issues. Each subject was rated with the BPRS-measure, or the “brief psychiatric rating scale (BPRS),” which is used to evaluate patients suspected of having schizophrenia. The BPRS assesses the level of 18 psychiatric issues (inter alia: hostility, suspiciousness, hallucinations and grandiosity), each of these is rated from one (not present) to seven (extremely severe). These BPRS-measures were taken before treatment began (week 0) and then weekly for eight weeks. Below we begin with tables of the BPRS dataset in wide and long form:

##     X treatment subject week0 week1 week2 week3 week4 week5 week6 week7 week8
## 1   1         1       1    42    36    36    43    41    40    38    47    51
## 2   2         1       2    58    68    61    55    43    34    28    28    28
## 3   3         1       3    54    55    41    38    43    28    29    25    24
## 4   4         1       4    55    77    49    54    56    50    47    42    46
## 5   5         1       5    72    75    72    65    50    39    32    38    32
## 6   6         1       6    48    43    41    38    36    29    33    27    25
## 7   7         1       7    71    61    47    30    27    40    30    31    31
## 8   8         1       8    30    36    38    38    31    26    26    25    24
## 9   9         1       9    41    43    39    35    28    22    20    23    21
## 10 10         1      10    57    51    51    55    53    43    43    39    32
## 11 11         1      11    30    34    34    41    36    36    38    36    36
## 12 12         1      12    55    52    49    54    48    43    37    36    31
## 13 13         1      13    36    32    36    31    25    25    21    19    22
## 14 14         1      14    38    35    36    34    25    27    25    26    26
## 15 15         1      15    66    68    65    49    36    32    27    30    37
## 16 16         1      16    41    35    45    42    31    31    29    26    30
## 17 17         1      17    45    38    46    38    40    33    27    31    27
## 18 18         1      18    39    35    27    25    29    28    21    25    20
## 19 19         1      19    24    28    31    28    29    21    22    23    22
## 20 20         1      20    38    34    27    25    25    27    21    19    21
## 21 21         2       1    52    73    42    41    39    38    43    62    50
## 22 22         2       2    30    23    32    24    20    20    19    18    20
## 23 23         2       3    65    31    33    28    22    25    24    31    32
## 24 24         2       4    37    31    27    31    31    26    24    26    23
## 25 25         2       5    59    67    58    61    49    38    37    36    35
## 26 26         2       6    30    33    37    33    28    26    27    23    21
## 27 27         2       7    69    52    41    33    34    37    37    38    35
## 28 28         2       8    62    54    49    39    55    51    55    59    66
## 29 29         2       9    38    40    38    27    31    24    22    21    21
## 30 30         2      10    65    44    31    34    39    34    41    42    39
## 31 31         2      11    78    95    75    76    66    64    64    60    75
## 32 32         2      12    38    41    36    27    29    27    21    22    23
## 33 33         2      13    63    65    60    53    52    32    37    52    28
## 34 34         2      14    40    37    31    38    35    30    33    30    27
## 35 35         2      15    40    36    55    55    42    30    26    30    37
## 36 36         2      16    54    45    35    27    25    22    22    22    22
## 37 37         2      17    33    41    30    32    46    43    43    43    43
## 38 38         2      18    28    30    29    33    30    26    36    33    30
## 39 39         2      19    52    43    26    27    24    32    21    21    21
## 40 40         2      20    47    36    32    29    25    23    23    23    23
##     X treatment subject weeks bprs week
## 1   1         1       1 week0   42    0
## 2   2         1       2 week0   58    0
## 3   3         1       3 week0   54    0
## 4   4         1       4 week0   55    0
## 5   5         1       5 week0   72    0
## 6   6         1       6 week0   48    0
## 7   7         1       7 week0   71    0
## 8   8         1       8 week0   30    0
## 9   9         1       9 week0   41    0
## 10 10         1      10 week0   57    0
## 11 11         1      11 week0   30    0
## 12 12         1      12 week0   55    0
## 13 13         1      13 week0   36    0
## 14 14         1      14 week0   38    0
## 15 15         1      15 week0   66    0
## 16 16         1      16 week0   41    0
## 17 17         1      17 week0   45    0
## 18 18         1      18 week0   39    0
## 19 19         1      19 week0   24    0
## 20 20         1      20 week0   38    0
## 21 21         2       1 week0   52    0
## 22 22         2       2 week0   30    0
## 23 23         2       3 week0   65    0
## 24 24         2       4 week0   37    0
## 25 25         2       5 week0   59    0
## 26 26         2       6 week0   30    0
## 27 27         2       7 week0   69    0
## 28 28         2       8 week0   62    0
## 29 29         2       9 week0   38    0
## 30 30         2      10 week0   65    0

The Analysis

When possible, model-fitting provides a more rigorous tool for analysing data compared to the above discussed usmmary methods. Nevertheless, with longitudinal data, the key assumption of basic linear models - that observations be independent of other observations - often does not hold. We will begin by showcasing this graphically. Instead of examining the data through a plot that is similar to the “Plot of weight against time for rat data” in chapter 9 of MABS, here the representation of the progression of BPRS scores is done through two linear graphs, each representing one of the treatment groups. I find this much more informative than the slightly confusing numbers-chart referred to above.

## 
## Call:
## lm(formula = bprs ~ week + treatment, data = BP_Long)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.454  -8.965  -3.196   7.002  50.244 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.4539     1.3670  33.982   <2e-16 ***
## week         -2.2704     0.2524  -8.995   <2e-16 ***
## treatment2    0.5722     1.3034   0.439    0.661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared:  0.1851, Adjusted R-squared:  0.1806 
## F-statistic: 40.55 on 2 and 357 DF,  p-value: < 2.2e-16

The above graphical representation would seem to provide evidence of the dependent nature of the observations on the previous scores of the subject. They are not completely random, but to an extent paint a trend. This is especially clear in the graphset, which showcases how the results of each week are correlated with the results of the previous weeks.

Additionally, both the graphical representation and the basic linear model seem to indicate that treatment 1 has better results for the bprs scores of the participants. Nevertheless, the graphs are not completely clear due to the large number of subjects, and the basic linear model applied above ignores the repeated-measures structure of the data. As such, it ignores the linkage of observations through by single subjects. More nuanced models can be applied. Below, one can find a gradual progression towards a more complex and fitting linear model.

The first summary data come from the Random Intercept Model, which allows us to abandon the independence assumption between all observations and link observations belonging to one subject. As such, it permits the linear regression fit for each subject to differ in intercept from other subjects.

The next summary data is from the Random Intercept and Random Slope Model, which not only allows each subject to have their own intercept, but also their own slope. This in turn makes it possible incorporate the different progressions of each subject and the effect of time.

Finally, the third summary data is from the Random Intercept and Random Slope Model with a focus on week by treatment interaction.

## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
##    Data: BP_Long
## 
##      AIC      BIC   logLik deviance df.resid 
##   2748.7   2768.1  -1369.4   2738.7      355 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0481 -0.6749 -0.1361  0.4813  3.4855 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept)  47.41    6.885  
##  Residual             104.21   10.208  
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     1.9090  24.334
## week         -2.2704     0.2084 -10.896
## treatment2    0.5722     1.0761   0.532
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.437       
## treatment2 -0.282  0.000
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
##    Data: BP_Long
## 
##      AIC      BIC   logLik deviance df.resid 
##   2745.4   2772.6  -1365.7   2731.4      353 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8919 -0.6194 -0.0691  0.5531  3.7976 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.8222  8.0512        
##           week         0.9609  0.9802   -0.51
##  Residual             97.4305  9.8707        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     2.1052  22.066
## week         -2.2704     0.2977  -7.626
## treatment2    0.5722     1.0405   0.550
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.582       
## treatment2 -0.247  0.000
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week * treatment + (week | subject)
##    Data: BP_Long
## 
##      AIC      BIC   logLik deviance df.resid 
##   2744.3   2775.4  -1364.1   2728.3      352 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0512 -0.6271 -0.0768  0.5288  3.9260 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.9964  8.0620        
##           week         0.9687  0.9842   -0.51
##  Residual             96.4707  9.8220        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##                 Estimate Std. Error t value
## (Intercept)      47.8856     2.2521  21.262
## week             -2.6283     0.3589  -7.323
## treatment2       -2.2911     1.9090  -1.200
## week:treatment2   0.7158     0.4010   1.785
## 
## Correlation of Fixed Effects:
##             (Intr) week   trtmn2
## week        -0.650              
## treatment2  -0.424  0.469       
## wek:trtmnt2  0.356 -0.559 -0.840

As we are interested in the model that provides the best representation of the data, we will use the ANOVA-method and examine the resulting p-values of the chi-squared test. The smaller the value, the better the model is compared to the contrasted model. The results of the ANOVAs can be found below:

## Data: BP_Long
## Models:
## BP_ri: bprs ~ week + treatment + (1 | subject)
## BP_rirs: bprs ~ week + treatment + (week | subject)
##         npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## BP_ri      5 2748.7 2768.1 -1369.4   2738.7                       
## BP_rirs    7 2745.4 2772.6 -1365.7   2731.4 7.2721  2    0.02636 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Data: BP_Long
## Models:
## BP_rirs: bprs ~ week + treatment + (week | subject)
## BP_rirsX: bprs ~ week * treatment + (week | subject)
##          npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## BP_rirs     7 2745.4 2772.6 -1365.7   2731.4                       
## BP_rirsX    8 2744.3 2775.4 -1364.1   2728.3 3.1712  1    0.07495 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Data: BP_Long
## Models:
## BP_ri: bprs ~ week + treatment + (1 | subject)
## BP_rirsX: bprs ~ week * treatment + (week | subject)
##          npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## BP_ri       5 2748.7 2768.1 -1369.4   2738.7                       
## BP_rirsX    8 2744.3 2775.4 -1364.1   2728.3 10.443  3    0.01515 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The smallest p-value is between the basic random intercept model and the Random Intercept and Random Slope Model which allows from week by treatment interaction. In fact, each move towards a more complicated model produces an improvement in fit. This improvement is not statistically significant between the Random Intercept and Random Slope Model which allows from week by treatment interaction, and the general Random Intercept and Random Slope Model, but the latter is still the preferred one, as it provides the best fit overall, with the highest chi-squared score and lower p-value. As such, going forward, we will apply this model in our analysis.

In the below grahical representation we can see the applicability of our fitted model to the observed bprs scores. the colors indicate the subject within the treatment group.

## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week * treatment + (week | subject)
##    Data: BP_Long
## 
##      AIC      BIC   logLik deviance df.resid 
##   2744.3   2775.4  -1364.1   2728.3      352 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0512 -0.6271 -0.0768  0.5288  3.9260 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.9964  8.0620        
##           week         0.9687  0.9842   -0.51
##  Residual             96.4707  9.8220        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##                 Estimate Std. Error t value
## (Intercept)      47.8856     2.2521  21.262
## week             -2.6283     0.3589  -7.323
## treatment2       -2.2911     1.9090  -1.200
## week:treatment2   0.7158     0.4010   1.785
## 
## Correlation of Fixed Effects:
##             (Intr) week   trtmn2
## week        -0.650              
## treatment2  -0.424  0.469       
## wek:trtmnt2  0.356 -0.559 -0.840

As can be seen from the above graphical representation, the fitted model provides a decent fit, but not a perfect one. What becomes clear from it though, is the fact that both treatments seem to produce an effect on the bprs-scores. The general trend is a downward one, with Treatment 1 producing slightly lower scores. As the above repretition of the summary shows us, the t-value for the effect of time on the intercept is high enough to indicate statistical significance. The correlation of the fixed effects indicates a medium strong negative correlation.

In case of the treatment type, the correlation of fixed effects gives us a intermediate negative correlation moving from treatment 1 to treatment 2. Nevertheless, the t value for this does not provide sufficient evidence of statistical significance. As such here the null hypothesis - that treatment type has no effect - is not disproved.